Fall 2022 Data Science Intern Challenge

Candidate Name - Rutvik Vikas Raut

Question 1 - Solution Summary

a. Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

b. What metric would you report for this dataset?

c. What is its value?

Importing the required libraries

Reading the dataset

Checking the datatypes of the columns in our dataset

Checking to see if there are any null values

Checking the number of unique values for each column in our dataset

Converting 'created_at' column to date-time datatype

Checking to see if any of the orders in the dataset fall outside the 30 day window

Calculating unit price for each order

Creating a copy of our dataframe to make corrections in case we spot anomalies

Exploratory Data Analysis

Sorting the unit price in descending order to detect any possible high value prices

We now know that there are some outliers in the dataset as we have a few orders with extremely high per unit price.

Let's check some stats for the order amount column

Here we can see that the mean value is very high compared to the median value, which proves that due to a few outliers the mean or average order value has been increased.

Let's group the unit price by shop id to target the shops with high order values

From the graphs and data, it is now clear that shops 78 and 42 have high unit prices, let's examine them one by one.

From the above analysis, we can see that for all the users that have made purchases from shop 78, they have been charged $25725 per sneaker, which is unusually high amount.

There are chances that the price of the sneaker might be $257.25, and there might be a typing mistake while entering this data.

But we are not sure of this information, hence we will replace this high value amount by the median unit price of the dataset.

And we will also calculate the updated order amount.

Let's now check the anomalies in shop 42

It looks like there is an issue with only user 607 who has purchased sneakers from shop 42. Let's filter out only for this specific user.

We can see that user 607 has shopped for 2000 sneakers every time, which is an unsually high amount for a single person.

This user might be another reseller of the sneakers or this user might have been wrongly entered in this dataset from the sellers database.

But as we are not sure what is the reason for this anomaly, we will replace this high amount with the median number of items in the dataset.

As well as we will update total order amount by calculating from the unit price and updated total items.

Comparing the stats of our corrected dataset with the original dataset, we can see that the mean is now much closer to the median value.

Once again visualizing the corrected dataset to see if there are any more outlier values.

All the values now seem to be in reasonable ranges.

Question 2

a. How many orders were shipped by Speedy Express in total?



SELECT COUNT(DISTINCT OrderID) FROM Orders o


INNER JOIN Shippers s


ON o.ShipperID = s.ShipperID


WHERE s.ShipperName = 'Speedy Express'


Solution a -

54

--------------------------------------------------------------------------------------------------------------------------------------------------------------

b. What is the last name of the employee with the most orders?



SELECT e.LastName, COUNT(o.OrderID) AS Number_Of_Orders


FROM Orders o


LEFT JOIN Employees e


ON o.EmployeeID = e.EmployeeID


GROUP BY LastName


ORDER BY Number_Of_Orders DESC


LIMIT 1


Solution b -

LastName - Peacock

Number_Of_Orders - 40

--------------------------------------------------------------------------------------------------------------------------------------------------------------

c. What product was ordered the most by customers in Germany?



SELECT c.Country, od.ProductID, p.ProductName, SUM(od.Quantity) AS Total_Order


FROM Customers c, OrderDetails od, Orders o, Products p


WHERE c.CustomerID = o.CustomerID


AND od.ProductID = p.ProductID


AND o.OrderID = od.OrderID


AND c.Country = 'Germany'


GROUP BY ProductName


ORDER BY Total_Order DESC


LIMIT 1


Solution c -

Country - Germany

ProductID - 40

ProductName - Boston Crab Meat

Total_Order - 160

--------------------------------------------------------------------------------------------------------------------------------------------------------------